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ABSTRACT 


This article reviews several proposed mechanisms of molecular evolution operating in non-coding regions of the 
chloroplast genome and argues that awareness and identification of these mechanisms are essential for improving 
alignment and phylogenetic analysis of non-coding sequence data. The mechanisms are of five categories: (1) slipped- 
strand mispairing; (2) insertions and deletions linked with secondary structure formations; (3) inversions associated 
with hairpins and stem-loop structures; (4) localized or extra-regional intramolecular recombination; and (5) nucleotide 
substitutions. These mutations seem to be largely a function of sequence structure and pattern and may be highly 
homoplasious in a parsimony topology; therefore, mutations in non-coding regions of the chloroplast genome are de- 
scribed here as structured, nonrandom, and non-independent events. Established methodologies are based in large part 
on a collective understanding of genic DNA evolution and may need modification when applied to non-coding sequence 
data. Here I suggest an approach to the phylogenetic study of non-coding cpDNA that incorporates identification of 
mutational mechanisms in alignment and homology assessment of indels. I also discuss repercussions of non-coding 
sequence evolution for such aspects of phylogeny estimation as maximum likelihood, distance, and parsimony analysis, 
the inclusion of indels as phylogenetic characters, and bootstrapping, jackknifing, and “decay” analysis as measures 


of clade support. 
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There is growing interest in comparative analysis 
of non-coding chloroplast (non-coding cpDNA) se- 
quences for plant systematic studies at low taxo- 
nomic levels. Recognition of the limitations of cod- 
ing (genic) DNA for resolving relationships at these 
levels inspired the probing of chloroplast introns 
and intergenic spacers for phylogenetic utility. Un- 
derlying this effort was the reasonable premise that 
non-coding regions experience limited or no selec- 
tive pressure and are likely to evolve at rates far 
surpassing those of genic regions (e.g., Curtis & 
Clegg, 1984; Wolfe et al., 1987; Palmer, 1987, 
1991; Olmstead & Palmer, 1994; Bohle et al., 
1994), There was also an expectation that non-cod- 
ing regions should experience random and inde- 
pendent mutations, both in mode and distribution. 

For these reasons, a remarkable number of plant 
systematics studies currently in progress include a 
molecular component of comparative analysis of 
non-coding cpDNA sequences. A considerable 
amount of work already published has demonstrat- 
ed the potential phylogenetic utility of discrete non- 


coding regions in the chloroplast: the trnL-trnF 
spacer (e.g., Gielly & Taberlet, 1994; Mes & t’Hart, 
1994; van Ham et al., 1994; Sang et al., 1997; Cros 
et al., 1998; Bayer & Starr, 1998), the trnT-trnL 
spacer (Böhle et al., 1994, 1997; Small et al., 
1998), the rpoA-petD and rpsll-rpoA spacers (Pe- 
terson & Seberg, 1997), the atpB-rbcL spacer (Go- 
lenberg et al., 1993; Hodges & Arnold, 1994; Na- 
tali et al., 1995; Samuel et al., 1997; Savolainen et 
al., 1997; Setoguchi et al., 1997; Hoot & Douglas, 
1998), the rbcL-psal spacer (Morton & Clegg, 
1993), the psbA-trnH spacer (Aldrich et al., 1988; 
Sang et al., 1997), the accD-psal spacer (Small et 
al., 1998), the rpl16-rpl14 and rps8-rpl14 spacers 
(Wolfson et al., 1991), the intron surrounding matK 
(Johnson & Soltis, 1994), the rpoC1 intron (Downie 
et al., 1996a, 1996b; Asmussen & Liston, 1998; 
Downie et al., 1998), the rp/16 intron (Jordan et al., 
1996; Kelchner, 1996; Kelchner & Clark, 1997; 
Schnabel & Wendel, 1998; Baum et al., 1998; 
Small et al., 1998), the trnL intron (Sang et al., 
1997; Bayer & Starr, 1998; Kajita et al., 1998; Bay- 
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er et al., 2000), the rps16 intron (Liden et al., 1997; 
Oxelman et al., 1997), and the ndhA intron (Small 
et al., 1998). 

The literature above not only reveals profound 
differences between the evolution of non-genic and 
genic cpDNA, but critically contradicts initial as- 
sumptions of constraint-free evolution in non-cod- 
ing regions. Recurring difficulties associated with 
non-coding sequence data include alternative 
alignment possibilities of insertions and deletions 
(indels), regions of length mutation in which ho- 
mology assessment is questionable or impossible, 
and the occurrence of localized “hot spots” of in- 
ferred excessive mutation, frequently to the point 
of saturation and loss of phylogenetic signal. How 


best to proceed with the phylogenetic analysis of 


such regions should be a topic of considerable con- 
cern (see Golenberg et al., 1993; Downie et al., 
1996a; Kelchner & Clark, 1997; Sang et al., 1997; 
Downie et al., 1998). 

It is now evident that sequence evolution in non- 
coding regions of the chloroplast is far more com- 
plex than previously supposed. Both introns and 
intergenic spacers are thought to embody a consid- 
erable degree of sequence structure, sometimes in 
a manner similar to that of ribosomal DNA (rDNA). 
This structure may generate either regionalized se- 
quence conservation or mutational hot spots of both 
nucleotide substitutions and _ insertion/deletion 
events. Sequence-directed initiators of mutational 
events may persist as “mutational triggers” (Kel- 
chner, 1996; Kelchner & Clark, 1997), dramatical- 
ly increasing the possibility of reversal or parallel 
gain of mutations, particularly length mutations or 
minute inversions. Hence, there exist essential vi- 
olations of the assumptions of randomized and in- 
dependent character evolution embedded in much 
of the current phylogenetic methodology for com- 
parative sequence analysis—methodology that is 
based largely on observational comparative study 
of coding sequence data. Considering that these are 
today’s commonly employed tools for phylogeny es- 
timation based on DNA sequences, there has been 
as yet remarkably little controversy in the literature 
about their application to non-genic sequence data. 

There are ways to account for mutational patterns 
observed in non-coding DNA. Comparative studies 
of non-coding cpDNA sequences during the past 
decade in particular (e.g., Palmer, 1985; Blasko et 
al., 1988; vom Stein & Hatchel, 1988; Wolfson et 
al., 1991; Golenberg et al., 1993; Gielly & Taberlet, 
1994; Morton, 1995a; Downie et al., 1996a; Kel- 
chner & Wendel, 1996; Kelchner & Clark, 1997; 
Sang et al., 1997) have allowed inference of spe- 
cific underlying mutational mechanisms responsi- 


ble for generating sequence diversity in non-coding 
regions of the chloroplast genome. Unfortunately, 
these mechanisms are often invoked, but rarely in- 
corporated, into the analysis. 

Recognition of the potential of structured molec- 
ular evolution in non-coding cpDNA regions to im- 
prove alignment and assessment of phylogenetic re- 
lationships is, I believe, critical for the 
development of functional molecular systematic re- 
search based on non-coding sequence data. Toward 
this end, I endeavor here to illustrate the following: 
(1) non-coding regions are highly structured and 
their elements evolve non-randomly and non-in- 
dependently; (2) this structure may be used to align 
the sequence matrix and better assess homology; 
(3) the resulting gaps in the aligned matrix may 
contain phylogenetically important information and 
should be used in a phylogenetic analysis; and (4) 
the mode of non-coding sequence evolution de- 
scribed here may have potentially serious reper- 
cussions for the accuracy of genetic-distance, max- 
imum likelihood, and parsimony analyses, and for 
bootstrapping and jackknifing techniques. A de- 
scription of proposed mechanisms of non-coding 
sequence evolution is followed by a discussion of 
the appropriateness of current alignment and anal- 
ysis procedures, with the expectation that it may 
provide a more informed approach to the applica- 
tion of non-coding sequence data in plant system- 
atics research. 

This article is not intended to be a complete re- 
view of literature pertaining to the evolution of in- 
trons and intergenic spacers in all genomes of an 
organism. Instead, it serves as a brief review of 
current literature on non-coding cpDNA regions, 
and summarizes mutational mechanisms suggested 
to occur in these regions. Discussed are some of 
the serious implications this manner of molecular 
evolution has for the assumptions underlying mod- 
els employed today by plant molecular systematists. 


MECHANISMS OF NON-CODING SEQUENCE 
EVOLUTION 


The strength of any phylogenetic estimation rests 
on the accuracy of character homology assessment. 
Thus, the molecular systematist strives to maximize 
character homology by the careful alignment of 
DNA sequences in a data matrix. Fundamental to 
any alignment procedure of non-coding cpDNA se- 
quence data should be a familiarity with mutational 
mechanisms directing molecular evolution in non- 
coding regions. Recognition of these mechanisms 
as generators of specific mutations can be a pow- 
erful tool for the placement of gaps and for the 
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assessment of probable homology of insertions and 
deletions (Kelchner, 1996; Kelchner & Clark, 
1997). 


SLIPPED-STRAND MISPAIRING (SSM) 


A widely reported mechanism of length mutation 
in non-coding regions of the chloroplast is slipped- 
strand mispairing (SSM). SSM is thought to be a 
major, even principal, factor in length mutations 
within non-coding regions of the chloroplast, mi- 
tochondrial, and nuclear genomes (e.g., Levinson & 
Gutman, 1987; Hancock, 1995; Wolfson et al., 
1991; Kelchner & Clark, 1997; Sang et al., 1997). 
Length mutations are important components of non- 
coding sequence evolution and have been suggest- 
ed to occur at least as frequently as base substi- 
tutions in some chloroplast non-coding regions 
(Curtis & Clegg, 1984; Wolfe et al., 1987; Zurawski 
& Clegg, 1987; Clegg & Zurawski, 1992; Golen- 
berg et al., 1993; Gielly & Taberlet, 1994; Clegg 
et al., 1994). 

Slipped-strand mispairing is thought to proceed 
by a localized mispairing of single-stranded DNA 
in regions of sequence repeats, as either a string of 
mononucleotide repeats or tandemly arranged mul- 
tibase repeat units (Palmer, 1991; Wolfson et al., 
1991; Cummings et al., 1994; Hancock, 1995; re- 
viewed by Levinson & Gutman, 1987). Diagrams of 
proposed SSM mechanics can be found in Levinson 
and Gutman (1987) and Wolfson et al. (1991). Be- 
cause A/T-rich regions of bacterial genomes are 
particularly susceptible to slipped-strand mispair- 
ing (Levinson & Gutman, 1987), one could expect 
a similar effect in the A/T-rich non-coding regions 
of the chloroplast genome (Wolfson et al., 1991). 
This is not to imply that SSM acts uniquely on A 
and T nucleotides; aligned non-coding sequence 
matrices often infer inserted repeats containing G 
and C nucleotides, sometimes as pure strings of G 
or C mononucleotide repeats. 

Strings of mononucleotide repeats, particularly of 
A or T, appear frequently in non-coding cpDNA, 
and slipped-strand mispairing may potentially gen- 
erate length mutations within these strings. The dif- 
ficulty in assessing homology of length variation in 
long strings of repeats, whether mononucleotide or 
multinucleotide repeats, derives from the increas- 
ing potential for further length mutation relative to 
string length (Streisinger & Owen, 1985; Golenberg 
et al., 1993; Kelchner & Clark, 1997; Sang et al., 
1997). Subsequent SSM activity may either gener- 
ate additional repeats of the initial sequence or de- 
lete sequence susceptible to slipped-strand mis- 


pairing. Perhaps an equilibrium might exist 
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between the probability of inserting subsequent 
length mutations and the probability of removing 
sequence from the repeat string. Whether such an 
equilibrium is present or not, there may be a com- 
petitive phenomenon that keeps the length of tan- 
dem repeated sequence units continually in flux. 
Representation of long repeat strings in non-coding 
sequence alignments would therefore be a “snap- 
shot” of sequences experiencing continual inser- 
tions and deletions at that locality. 

It follows that a point substitution within a long 
string of mononucleotide repeat units could act as 
a stabilizing factor, disrupting its previous unifor- 
mity and lowering the probability of further SSM 
events. Such a substitution would directly influence 
ensuing mutations in the region and is one example 
of a non-independent character mutation in non- 
coding DNA. If the situation were reversed, with a 
non-homogeneous sequence becoming a string of 
repeat units, the likelihood of an SSM event would 
increase and could induce further non-independent 
mutations by the addition or removal of repeated 
sequence by slipped-strand mispairing. 

As an aid to alignment, SSM-generated inser- 
tions and deletions can be used to position and 
determine number of gaps. A quick study of a re- 
peat unit or the flanking sequence of a gap may be 
enough to determine if slipped-strand mispairing is 
the likely progenitor of an observed length muta- 
tion. Occasionally, evidence of an SSM event may 
not be apparent, particularly if a deleted sequence 
is not a direct repeat of its flanking sequence, or if 
a subsequent length mutation due to another mech- 
anism obscures an earlier SSM event (Kelchner, 


1996). 


STEM-LOOP SECONDARY STRUCTURE 


Striking to both intergenic spacers and introns 
in the chloroplast genome is the presence and num- 
ber of probable secondary structures referred to as 
“stem-loops.” Stem-loops are believed to occur dur- 
ing single-stranding events when inverted repeats 
meet to form a region of pairing (the stem) sur- 
mounted by their interceding sequence (the loop). 
Such structures have been widely discussed for ri- 
bosomal DNA, with ITS and 18S rDNA regions be- 
ing of particular interest to the plant systematist 
(see Baldwin et al. (1995), Soltis et al. (1997), and 
Soltis & Soltis (1998) for discussion of secondary 
structures in these regions and their phylogenetic 
implications). 

Probable stem-loop secondary structure is com- 
monly reported in non-coding regions of organellar 
genomes (e.g., Michel et al., 1989; Buroker et al., 
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1990; Golenberg et al., 1993; van Ham et al., 1994; 
Gielly & Taberlet, 1994; Natali et al., 1995; Rigaa 
et al., 1995; Downie et al., 1996b; Kelchner & 
Wendel, 1996; Kelchner & Clark, 1997; Sang et 
al., 1997; Downie et al., 1998). Gielly and Taberlet 
(1994) reported several probable stem-loops in the 
trnL-trnF region of the chloroplast genome, includ- 
ing nine highly probable structures within the trnL 
intron itself. All other introns in the chloroplast ge- 
nomes of land plants are classified as Group II in- 
trons and share a diagnostic secondary structure of 
six well-defined stem-loop domains (Kohchi et al., 
1988; Michel et al., 1989; Downie et al., 1996b; 
Downie et al., 1998). Diagrams of putative single- 
stranded secondary structure of introns may be 
found in Michel and Dujon (1983), Michel et al. 
(1989), and Downie et al. (1998). 

Loop regions of stem-loop secondary structures 
are often associated with hot spots for mutation in 
non-coding regions, both of nucleotide substitutions 
and indel events (vom Stein & Hatchel, 1988; Al- 
drich et al., 1988; Golenberg et al., 1993; Gielly & 
Taberlet, 1994; van Ham et al., 1994; Clegg et al., 
1994; Ferris et al., 1995; Downie et al., 1996b; 
Kelchner & Clark, 1997). Indels located in prob- 
able loop sequence are frequently inserted or de- 
leted repeat units likely the result of SSM. How- 
length mutations not attributable to 
slipped-strand mispairing often occur within loop 


ever, 


sequences as well and may be remnants of recom- 
bination events. 

Although indels are most common in the termi- 
nal loop, they may occur anywhere along a second- 
ary structure. For example, Kelchner and Clark 
(1997) detected what appeared to be an entire de- 
letion of a small sub-loop positioned partway up the 
stem of an rp/16 intron stem-loop in Oryza sativa. 
Such side loops, when present, may be removed in 
some taxa without compromising the favorability of 
a stem formation. Occasionally, small segments of 
the stem itself will be deleted, decreasing the stem 
length, though perhaps not to an extent that would 
annihilate possible secondary structure formation. 

Very large loops are often associated with regions 
of chaotic or “labile” length variation characteristic 
of many non-coding cpDNA sequence matrices 
(e.g., Golenberg et al., 1993; Downie et al., 1996a; 
Soltis et al., 1996; Kelchner & Clark, 1997; Baum 
et al., 1998). Homology assessment here can be 
difficult or impossible, and the conservative ap- 
proach of removing these regions from the data ma- 
trix before phylogenetic analysis is frequently 
adopted. 

In contrast to the loop of stem-loop secondary 
structures being highly susceptible to nucleotide 


substitutions and length mutation, the inverted re- 
peated sequence composing the stem is frequently 
conserved in character (Learn et al., 1992; Gielly 
& Taberlet, 1994; Downie et al., 1996a, 1996b; 
Kelchner & Clark, 1997), particularly when stems 
are long and possess highly favorable energy of for- 
mation values (AG values; see Kelchner & Wendel, 
1996; Dumolin-Lapégue et al., 1998). A sequence 
involved in stem formation is less available for sub- 
stitution and length mutation because it is paired 
with its sister repeat; this can engender non-ran- 
domly and non-independently evolving sequence 
units. 

Similar to ribosomal RNA and rDNA secondary 
structure (e.g., Curtiss & Vournakis, 1984; Wheeler 
& Honeycutt, 1988; Dixon & Hillis, 1993; Soltis & 
Soltis, 1998), a nucleotide substitution occurring in 
a stem sequence of a non-coding cpDNA region 
could compromise secondary structure formation. 
Compensatory mutation may then occur to preserve 
the potential for structure formation (Kelchner, 
1996; Kelchner & Clark, 1997). Although se- 
quence conservation may be present merely as a 
function of sequence pattern (perhaps the case in 
intergenic spacers), the degree of secondary struc- 
ture conservation in a chloroplast Group II intron 
suggests secondary structures are integral to proper 
functioning of the intron (Clegg et al., 1986; Learn 
et al., 1992; Downie et al., 1996a). Experimental 
evidence has shown some of this structure is es- 
sential for auto-splicing mechanisms in Group I 
and II introns (Bonnard et al., 1984; Kohchi et al., 
1988; Dujon, 1989; Cech, 1990; Michel & Westhof, 
1990; Hibbett, 1996). 

Identification of probable secondary structure 
can be valuable when aligning and analyzing non- 
coding sequences by improving gap positioning and 
the appraisal of character homology. Gaps flanked 
by inverted repeats and regions relatively rich in G 
and C content are suspect as possible stems of sec- 
ondary structures. As noted, regions of chaotic 
length mutations are correlated with loops, so the 
boundaries of a chaotic region will frequently cor- 
respond with inverted repeats that can form a stem, 
even if they do not directly neighbor the chaotic 
region. Computer programs such as OLIGO (Ry- 
chlik & Rhoads, 1989), MULFOLD (Jaeger et al., 
1989; Zuker, 1989), and GCG’s Stemloop (Genetics 
Computer Group, Madison, Wisconsin) can assist 
in the detection of secondary structure in non-cod- 
ing sequences. A search can be conducted by hand, 
particularly if a published data set exists for the 
region. Free energy of formation values (AG) can 
be calculated with some of the prior software as an 
appraisal of the likelihood of formation of a partic- 
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ular secondary structure (see Kelchner & Wendel 
(1996) for an example where AG values were ap- 
plied to parallel inversion events in their data). 


MINUTE INVERSIONS 


Minute inversions of four to six base pairs have 
been linked to small stem-loop secondary struc- 
tures commonly referred to as hairpins (Kelchner 
& Wendel, 1996). Hairpins consist of a stem com- 
posed of nearly adjacent inverted repeats producing 
a stem-loop structure with a particularly small loop. 
This loop may become inverted by recombination, 
and the inversion may be so small that it either 
escapes notice during alignment (Kelchner & Wen- 
del, 1996; Kelchner & Clark, 1997), or the inverted 
sequence matches particular bases of the uninvert- 
ed sequence, resulting in a confusing array of mi- 
nute gaps (see Golenberg et al., 1993). 

Identifying minute inversions can require careful 
attention when aligning sequence data, particularly 
if alternative gap weighting schemes of an align- 
ment program have not been rigorously explored. 
Candidates for a hidden inversion are several ad- 
jacent nucleotide substitutions, a series of tiny 
gaps, or a gap that demonstrates no repeat aspect 
to its sequence structure. Alternatively, one could 
investigate these probable secondary structures by 
hand or with a secondary structure computer pro- 
gram. Failure to recognize minute inversions in a 
sequence data set has several repercussions for 
phylogenetic analysis, discussed fully in Kelchner 
and Wendel (1996) and summarized here in Anal- 
ysis of Non-Coding Sequence Data. 

Finally, small inversions associated with hairpins 
may be highly susceptible to reversal and parallel- 
ism within a study group, even at the interspecific 
level (Kelchner & Wendel, 1996; Kelchner & 
Clark, 1997; Sang et al., 1997; Dumolin-Lapégue 
et al., 1998). This susceptibility to reversal or par- 
allelism is due to the persistence of the mutational 
trigger (Kelchner & Clark, 1997)—the nearly ad- 
jacent inverted repeats—after the initial inversion 
event. 


NUCLEOTIDE SUBSTITUTIONS 


Nucleotide substitutions are generally reported 
as being more common in non-coding than in cod- 
ing regions (Wolfe et al., 1987; Zurawski & Clegg, 
1987; Olmstead & Palmer, 1994; Hoot & Douglas, 
1998; however, see Sang et al., 1997, for an excep- 
tion). Surprisingly, a number of studies report nu- 
cleotide substitutions as being just equal to or less 
frequent than length mutations in closely related 
taxonomic groups (Curtis & Clegg, 1984; Wolfe at 


Annals of the 
Missouri Botanical Garden 


al., 1987; Zurawski & Clegg, 1987; Clegg & Zu- 
rawski, 1992; Golenberg et al., 1993; Gielly & Ta- 
berlet, 1994; however, see Small et al., 1998). 

Percent AT content is quite variable in non-cod- 
ing cpDNA regions, though it is generally higher 
than the average value for the chloroplast genome 
(Shimada & Sugiura, 1991; Downie at al., 1996a; 
Small et al., 1998). Because of their high AT con- 
tent, non-genic regions must make a significant 
contribution to the high overall frequency of A and 
T in the chloroplast genome. Kajita et al. (1998) 
reported an AT content of 67% in the trnL-trnF 
spacer and trnL intron, Kelchner and Clark (1997) 
reported 70.5% AT composition in the intron of 
chloroplast gene rp/16 in bamboos, and Small et al. 
(1998) found an incredible 77.1% AT content in 
the intergenic spacer trnT-trnL in Gossypium. Un- 
doubtedly, this unequal tendency toward AT rich- 
ness in non-genic chloroplast DNA has several as 
yet undetermined implications for phylogenetic 
analysis of non-coding sequence data. At a mini- 
mum, it introduces a strong base composition bias 
into the analysis. 

Substitutions may demonstrate rather high levels 
of homoplasy in non-coding cpDNA regions due to 
the frequency of inferred multiple-hit sites (nucle- 
otide sites experiencing multiple substitution 
events). Multiple-hit sites occur even at very low 
estimates of percent sequence divergence (Kel- 
chner, 1996; Kelchner & Clark, 1997), suggesting 
that the accepted coding region estimates of 
“around 10-15%” sequence divergence for optimal 
phylogenetic signal may be inadequate measures 
for phylogenetic utility of a non-coding region. 

Precise understanding of mechanisms underlying 
multiple-hit substitutions in non-coding DNA is 
lacking. However, attributes of the molecular evo- 
lution of non-coding regions influence the manner 
of nucleotide mutation or the distribution of nucle- 
otide substitution events in an intron or intergenic 
spacer. Stem sequence and loop regions may dif- 
ferentially permit mutations, resulting in non-ran- 
domly distributed and non-independent nucleotide 
substitutions. Statistical significance of differential 
mutation rates in loops relative to stems may be 
tested for an adequate distribution model (see Olm- 
stead et al.’s (1998) test for stochastic mutation in 
the chloroplast genes ndhF and rbcL), yet has rare- 
ly, if ever, been performed on non-coding cpDNA 
data sets. 

In addition to secondary structure affecting the 
random distribution of nucleotide substitutions, 
there may be constraints on the type of mutation 
an individual site experiences. For example, there 
is a correlation between transition/transversion ra- 
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tios and neighboring base composition in non-cod- 
ing regions (Morton, 1995a, b; Morton et al., 1997; 
Savolainen et al., 1997). The correlation suggests 
that nucleotides flanked by A and/or T will dem- 
onstrate a significant tendency toward transversion 
mutations. Such a tendency limits possible nucle- 
otide replacements at these sites, increasing the 
chance of parallelism and reversals, particularly if 
the site experiences multiple hits. One would also 
expect transversion substitutions to be more com- 
mon in data sets of high AT content. 


INTRAMOLECULAR RECOMBINATION 


Intramolecular recombination on an extra-re- 
gional or genomic scale has been suggested be- 
tween adjacent or nearby repeats in the chloroplast 
genome (Howe, 1985; Palmer et al., 1985; Palmer 
et al., 1987; Blasko et al., 1988; Ogihara et al., 
1988; Milligan et al., 1989; Kanno & Hirai, 1992; 
Kanno et al., 1993; Morton & Clegg, 1993; Hoot & 
Palmer, 1994). In the context of non-coding se- 
quence comparison, such a large-scale recombi- 
nation involving the particular region of study could 
result in indels of surprising size that contain se- 
quence content not readily identifiable in origin. 

Recombination events may operate on a finer 
scale within a discrete non-coding region. Occa- 
sionally one infers extensive deleted sequence in 
an alignment with no apparent mechanistic expla- 
nation, presence of a small or moderately sized in- 
version, or a large insertion showing little congru- 
ence with surrounding sequence pattern. Such 
mutations suggest intramolecular recombination, 
and they frequently occur in the loop regions of 
probable secondary structures. Sequences involved 
in stem-loops may be particularly susceptible to re- 
combination events due to the conserved inverted 
repeats and mutationally flexible loop. Therefore, 
such structures could experience interactive recom- 
bination with other stem-loops, particularly with 
those existing in complementary sequence position. 

Recombination involving the entire loop of a sec- 
ondary structure may occur, particularly in struc- 
tures with long stems, resulting in minute or mod- 
erate-sized inversions in both intron and intergenic 
spacer regions (Natali et al., 1995; Kelchner & 
Wendel, 1996; Kelchner & Clark, 1997; Sang et 
al., 1997). Such incidents are often homoplasious 
(Kelchner & Wendel, 1996; Kelchner & Clark, 
1997; Sang et al., 1997; Dumolin-Lapégue et al., 
1998) due to the persistence of the mutational trig- 
ger; in this case, the hairpin stem. 

Intramolecular recombination is a notable alter- 
native to slipped-strand mispairing as a source for 
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certain inserted or deleted tandem-repeat length 
mutations (Palmer, 1985; Blasko et al., 1988). 
However, Wolfson et al. (1991), Sang et al. (1997), 
and Kelchner and Clark (1997) suggested SSM is 
a more likely mechanism for length mutation in 
their studies of chloroplast introns and intergenic 
spacers. 


ALIGNMENT 


There are many philosophies for sequence align- 
ment, and much of the literature centers on the 
proper application of computer software for this 
purpose. The structure present in a non-coding 
cpDNA sequence makes it an excellent example for 
discussing what I believe to be the fundamental 
problem of most computer alignment programs: de- 
fining the nucleotide as a discrete and independent 
character. The identification of secondary structure 
and mutational mechanisms in the data may greatly 
improve on current algorithmic alignments of gaps, 
and thus on assessment of character homology. 

Many have found software, particularly versions 
of CLUSTAL (Higgins et al., 1992; Thompson et 
al., 1994), to be of help at least initially with the 
alignment of non-coding sequences. The alignment 
is then subjected to an “improvement by hand” to 
position gaps (e.g., Samuel et al., 1997; Downie et 
al., 1998; Bayer & Starr, 1998; Kajita et al., 1998). 
This procedure saves time if the sequences are sim- 
ilar in length, but when indels become numerous 
in the data matrix the difficulties of alignment dra- 
matically increase. This is because most alignment 
software initially regards each character in the ma- 
trix as an independent unit, unless otherwise spec- 
ified by particular position or gap weighting 
schemes defined by the user. The software is in- 
capable of determining when mutations other than 
substitutions have arisen, such as non-independent 
insertions, deletions, or inversions correlated with 
SSM and secondary structure. Appropriate weight- 
ing for these mutations that could be incorporated 
into an alignment algorithm is, at present, unde- 
veloped. 

The Elision method of Wheeler et al. (1995) at- 
tempts to improve gap placement and indel homol- 
ogy by alignment software. The Elision method uses 
standard alignment algorithms to produce a series 
of competing alignments based on varying gap 
weighting schemes. These competing alignments 
are then combined in a single matrix and an anal- 
ysis is performed, with the effect that support is 
increased for aligned regions that most frequently 
appear among the various gap-weighting schemes. 
This method aims at objectivity, but makes no im- 
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provement on the alignment algorithm’s inability to 
assess mutation types other than independent point 
substitutions. Mutations in non-coding regions are 
influenced by surrounding sequence structure and 
frequently occur not as independent base mutations 
but as linked multinucleotide mutation events, like 
the insertion of a repeat unit (Kelchner, 1996; Kel- 
chner & Clark, 1997). The likelihood that many 
non-coding mutations are derived from sequence 
fragments that are inserted, deleted, inverted, or 
otherwise rearranged, negates the assumption of 
discrete, independent nucleotide characters under- 
lying all alignment algorithms, as well as any ex- 
tension of those algorithms like the Elision method. 

At a minimum, those using sequence alignment 
programs to establish putative homology of char- 
acters in their data matrix should experiment with 
a wide variety of gap-weighting options. These op- 
tions, however, may not reveal the underlying mu- 
tational mechanisms occasioning sequence rear- 
rangements in chloroplast non-coding regions. They 
may, however, facilitate the rapid alignment of seg- 
ments of the matrix that share consistent sequence 
integrity and thus pinpoint regions of variable 
length that require special consideration. 

Alternatively, some have avoided alignment pro- 
grams entirely and describe aligning sequences by 
hand (e.g., Golenberg et al., 1993; Hodges & Ar- 
nold, 1994; Kelchner & Clark, 1997). This ap- 
proach facilitates a careful study of the matrix as 
it forms and increases the researcher's familiarity 
with mutations in the sequences. However, align- 
ment by hand, especially when dealing with con- 
siderably divergent taxa or with the presence of a 
great number of length mutations, can be tedious 
and time consuming. 

Kelchner and Clark (1997) suggested that aware- 
ness of the proposed mutational mechanisms active 
in non-coding regions can be useful for inferring 
and positioning gaps and ultimately in assessing 
homology. Golenberg et al. (1993) were the first to 
detail a criterion for aligning gaps in non-coding 
cpDNA matrices. Based on their example, Kel- 
chner (1996) and Kelchner and Clark (1997) mod- 
ified the alignment criterion for chloroplast rpl16 
intron sequences. Hoot and Douglas (1998) also re- 
vised Golenberg et al.’s (1993) method of gap align- 
ment, framing the beginnings of a nomenclatural 
procedure for defining gap categories. Although a 
nomenclatural system is not requisite for gap treat- 
ment in a phylogenetic analysis, it may be useful 
in collating information of inferred mutational 
mechanisms if universally applied in non-coding 
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ALIGNMENT ISSUES: EXAMPLES FROM NON-CODING 
cpDNA DATA 


Here I present examples (Kelchner & Wendel, 
1996; Kelchner & Clark, 1997; Kelchner, unpub- 
lished data) to illustrate the inference of mutational 
mechanisms in non-coding cpDNA sequences and 
demonstrate the practice of applying mechanistic 
explanations to alignment and homology assess- 
ment. Nucleotides in lower-case bold print are in- 
ferred insertions; underlined nucleotides indicate 
the probable progenitor sequence of an insertion or, 
in Examples 4 and 5, call attention to a particular 
sequence of interest. 

A common type of insertion in non-coding 
cpDNA is a direct repeat of a neighboring sequence 
(“Type la” gap; Golenberg et al., 1993; Hoot & 
Douglas, 1998). These often take the form of vari- 
able-length strings of a mononucleotide repeat unit 


(Example 1). 


EXAMPLE 1. 


l. TTAAAAAAAAA---TTGA 


2. TTAAAAAAAAAA- -TTGA 
3. TTAAAAAAAA----TTGA 
4. TTAAAAAAAAAAAATTGA 


Homology can be highly uncertain for these re- 
peated nucleotides. Therefore, such regions are ei- 
ther removed from consideration as potential phy- 
logenetic characters (a conservative approach) or 
included as coded gap characters corresponding to 
length of the repeat string (often becoming highly 
homoplasious in the context of a resulting topology). 
Uncertainty of homology is exacerbated by potential 
inaccuracies of enzymatic processes during PCR 
amplification and sequencing, which can also gen- 
erate variable-length repeat strings independent of 
the template’s sequence constitution. When strings 
of adjacent mononucleotide repeats are highly var- 
iable in length in a matrix and reach or exceed the 
range demonstrated above, they become more likely 
to experience further SSM mutation. For this rea- 
son, it is perhaps most reasonable to remove such 
areas from consideration in a phylogenetic analysis. 

Insertions can also be multinucleotide repeat 
units of a neighboring sequence, as demonstrated 
in Example 2 by the inserted repeat unit ataaa 
(“Type lb” gap; Golenberg et al., 1993; Hoot & 
Douglas, 1998). 
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EXAMPLE 2. 
l. ATAAAACAAA- ---- GAGCG 


2. ATAAAATAAAataaaGAGCG 


3. ATAAAATAAA----- GAGCG 
4. ATAAAATAAA----- GAGCG 


An inserted repeat of this nature could be exten- 
sive in length and may be difficult to recognize as 
a repeat unit during alignment (for example, I have 
identified a 73 bp inserted repeat [unpublished 
data] in the trnT-trnL intergenic spacer in Myopor- 
aceae). A repeat unit by its very nature shares nu- 
cleotide content and order with flanking sequence; 
therefore, multiple gaps may be inferred by pairing 
segments of an inserted repeat with its progenitor 
sequence. This is particularly problematic if the in- 
sertion or its progenitor has experienced subse- 
quent nucleotide substitutions. 

Even when a single gap is inferred, positioning 
of the gap may hide evidence that the insertion is 
a repeat unit. Example 3 is reproduced from Kel- 
chner and Clark (1997) and demonstrates how a 
repeat unit may be obscured in a sequence matrix. 


EXAMPLE 3. 


A. 
1. GGTTATGA ----- ATTAACA 
2. GGTTATAA ----- ATTAACA 
3. GGTTATAA tataa ATTAACA 
4. GGTTATAA tataa ATTAACA 
B. 
l. GGTTATs=> =s GA ATTAACA 
2 GGTIAT == ==5 AA ATTAACA 
af GGTTATAA tataa ATTAACA 
4. GGTTATAA tataa ATTAACA 
G: 
l. GGTTA e Se TGA ATTAACA 
2. CGGITA=5sr += TAA ATTAACA 


3. GGTTATAA tataa ATTAACA 


4. GGTTATAA tataa ATTAACA 


Alignment possibilities A, B, and C were equally 
probable using CLUSTAL W (Thompson et al., 
1994). Only alignment A reveals the insertion is a 


repeat unit—a common mutation type in non-cod- 
ing regions. If alignment options B or C were used 
for phylogenetic analysis, the content of the inser- 
tion would be of unexplainable origin (though still 
possible) and the potential of incorrectly assessing 
nucleotide homology in the region may be consid- 
erable. 

Any of the gap positions in this particular ex- 
ample would not affect a topology generated from 
these four taxa, but gap positioning may have a 
significant effect in a larger matrix of more distantly 
related taxa. The position of the gap in alignment 
3A and detection of the repeat unit may also be 
relevant in determining a weighting scheme for 
these non-independent characters. 

Length mutations may overlap with one another 
to create a progressive-step indel. In the more ex- 
treme cases, appraisal of homology in these regions 
can be very difficult, or impossible (Palmer et al., 
1985; Downie et al., 1996b; Kelchner & Clark, 
1997). Example 4 demonstrates a probable pro- 
gressive-step indel in which two possible place- 
ments exist for the repeat TTGA. Note that the un- 
derlined sequence is a direct repeat of the 
preceding sequence TCGTAATTGA in the matrix. 


EXAMPLE 4. 


l. AATCGTAATTGA ---------- ---- AACAGA 
2. AATCGTAATTGA ---------- ---- AACAGA 
3. AATCGTAATTGA TCGTAATTGA ----AACAGA 
4. AATCGTAATTGA TCGTAATTGA ----AACAGA 


5. AATCGTAATTGA TCGTAATTGA ttgaAACAGA 


If part of the underlined TTGA in sequences 3 
and 4 is moved from its current position to align 
with ttga in sequence 5, the possibility that the 
ttga sequence is a direct repeat of the preceding 
sequence may be obscured; however, this alignment 
choice would not be impossible. As the preceding 
sequence to the underlined 10 bp repeat does not 
contain this additional ttga repeat, we can infer 
that two separate events have given rise to an initial 
10 bp insertion in sequences 3 and 4, followed by 
an additional 4 bp insertion in sequence 5. Wheth- 
er ttga itself or the preceding TTGA is the sub- 
sequent inserted mutation is impossible to deter- 
mine. In this case, either alternative alignment of 
the TTGA unit would cause no effect in a phylo- 
genetic analysis; it is most important here to dis- 
cern the two length mutation events. If any poten- 
tially informative nucleotide substitutions were 
present in either of the repeat units in Example 4, 
these substitutions should be excluded from a phy- 
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logenetic analysis on the basis that nucleotide ho- 
mology of the repeats is not discernable. 

The example above suggests that homology may 
be indicated by the length of insertions or deletions 
in a gap, although such an assumption is not with- 
out risk. Example 5 below demonstrates multiple 
possible alignments of the gatt repeat unit (repre- 
sented individually by sequences 2, 3, and 4) with 
the insertion in sequence 1. 


EXAMPLE 5. 


1. CAGATTGATTGATTATTATACTGATTATGC 





2. CAGATT---------------- gattATGC 
3. CAGATTgatt---------------- ATGC 
4. CAGATT----gatt------------ ATGC 
5. CAGATT---------- -----ATGC 


Again, actual homology is impossible to assess 
with confidence, for there exist three GATT repeat 
units in the insertion in sequence 1. In cases like 
this, homology is often inferred on the basis of 
length of indel and minimum number of gaps re- 
quired to position the repeat. Hence, the gatt re- 
peats in sequences 2, 3, and 4 would be aligned 
one above the other and on one side of the gap to 
reduce the number of inferred indel events. When 
coding indels as characters, this would be a rea- 
sonable solution in lieu of other evidence for indel 
origin, and the repeat gatt would be treated as ho- 
mologous for those sequences that contain it. 

Equal length of insertions may not be strong ev- 
idence of their homology (Kelchner, 1996; Kel- 
chner & Clark, 1997; Hoot & Douglas, 1998). Con- 


sider the insertions in Example 6A. 
EXAMPLE 6A. 


l. GGTTAAT tctat TCTATCT 





2. GGTTAAT ttaat TCTATCT 


3. GGTTAAT ttaat TCTATCT 
4. GGTTAAT ----- TCTATCT 
5. GGTTAAT ----- TECTATCT 


Alignment of the insertions in Example 6A re- 
sults in the probably mistaken homology of indels 
in sequences 2 and 3 with that of sequence 1. The 
insertion in sequence 1 likely arose from an in- 
serted repeat of the sequence to the right of the 
gap, TCTAT. This would be a more parsimonious 
explanation, in terms of total number of mutation 
events, than to infer a single inserted repeat fol- 
lowed by two adjacent nucleotide substitutions in 
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sequence 1. Sequences 2 and 3 probably share a 
similar origin as a repeat of the preceding sequence 
TTAAT. The events, aligned as they are in Example 
6A, are probably non-homologous. A re-alignment 
could be performed to accommodate the two sepa- 
rate indel events (Example 6B), even though the 
insertions are of the same length and the alignment 
infers an additional gap (see Hoot & Douglas, 
1998). 


EXAMPLE 6B. 
l. GGTTAAT ----- tctat TCTATCT 
2. GGTTAAT ttaat ----- TCTATCT 
3. GGTTAAT ttaat ----- TCTATCT 
A. GGTTPAAT ==- —=-== TCTATCT 
D. GQTTAAT s=<=—5 cnten TCTATCT 


There is a hazard that minute inversions (Kel- 
chner & Wendel, 1996) can be completely ob- 
scured in a matrix if they introduce no gaps during 
alignment, particularly if alternative gap-weighting 
schemes have not been rigorously pursued. If pre- 
sent and unrecognized in a data matrix, minute in- 
versions may overweigh a particular mutation by 
interpreting the single mutation event (an inversion) 
as multiple apomorphies of adjacent nucleotide 
substitutions. Example 7 below illustrates a situa- 
tion in which sequences 2 and 3 share the inversion 


TTGG to CCAA (from Kelchner & Wendel, 1996). 
EXAMPLE 7. 
. TAATATT TTGG AATATTA 


. TAATATT CCAA AATATTA 


1 
2 
3. TAATATT CCAA AATATTA 
4. TAATATT TTGG AATATTA 
5 


. TAATATT TTGG AATATTA 


If the inversion is of sufficient length to introduce 
multiple gaps in the matrix (see Golenberg et al., 
1993; Sang et al., 1997), two possibilities can oc- 
cur: the gaps will be misaligned to parts of the in- 
verted sequence sharing spurious sequence simi- 
larity with the uninverted sequences; or, there will 
be inference of an inserted sequence of unknown 
origin (in reality, the inverted nucleotides), which 
corresponds with a deletion in the homologous un- 
inverted sequences. Each possibility will lead to 
inaccurate assessment of homology and may poten- 
tially have a considerable effect on phylogeny es- 
timation. 

Regions in the matrix demonstrating many in- 
dependent variable-length insertion and deletion 
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events will likely be associated with secondary 
structures, specifically with loop regions of stem- 
loops (Kelchner, 1996; Downie et al., 1996b; Kel- 
chner & Clark, 1997). Identification of flanking se- 
quences involved in possible stem formation could 
locate the boundaries for the region and aid in 
aligning the indels. Discerning probable SSM-sus- 
ceptible sites can also be informative for the infer- 
ence of parallel and reversed insertions or dele- 
tions. 

Perhaps methods of gap or character weighting 
and alignment based on mechanisms of mutation 
can be incorporated into software designed for non- 
coding sequence alignment, particularly by includ- 
ing an evaluation of AG values for probable sec- 
ondary structures. However, the diversity of rates 
and types of molecular evolution in non-coding re- 
gions may be profound. As with coding DNA, we 
are far from understanding all forces directing non- 
coding molecular evolution to a degree that we can, 
with any certainty, assign probabilities to individual 
mutations. 

Considering that alignment of sequence data is 
fundamental to the entire phylogeny estimation pro- 
cess, authors should more fully describe the steps 
taken to align their sequence data in order to pro- 
vide necessary information for the assessment of 
their proposed reconstructions of phylogenies. 


ANALYSIS OF NON-CODING SEQUENCE DATA 


The mechanisms of evolution described above 
have a number of significant implications for the 
phylogenetic analysis of non-coding sequence data. 
Among these are the following: 

(1) Slipped-strand mispairing can be the result 
of persistent mutational triggers (especially when 
the trigger sequence is located in the stem of a 
stem-loop secondary structure). This can introduce 
homoplasy from parallelisms and reversals into any 
phylogenetic estimations that include gap-coded 
characters in the matrix. Multiple indel events in a 
localized region may obscure homology of length 
mutations. Non-independence of these mutations 
introduces the issue of relative weighting of nucle- 
otide characters linked in a repeat unit, if each 
base is treated as a character in an analysis. Weight 
of the unit taken as a single character is also an 
issue if the unit is included in the analysis as a 
coded gap character. 

(2) Secondary structure shows nonrandom mu- 
tation in the form of compensatory mutation and 
possible homogenization of sequence necessary for 
stem formation. Loop sequence is available for mul- 
tiple mutations in the form of inversions, length 


mutations, and multiple-hit point substitutions, any 
of which may obscure evolutionary history. 

(3) Inversions may show high levels of parallel- 
ism and reversal, and their phylogenetic utility may 
not be particularly robust. Undetected minute in- 
versions may be buried within a data matrix and 
consequently treated as multiple base substitution 
synapomorphies instead of a single mutational 
event. 

(4) Nucleotide substitutions may be under pe- 
culiar constraints not fully understood. There is ev- 
idence of a bias in non-coding regions involving 
transition/transversion substitution ratios due to the 
influence of neighboring bases. A particular base 
may experience substitution events multiple times 
in closely related lineages, reaching saturation long 
before the expected saturation level for the remain- 
ing sequence. A base-composition bias toward A/T 
content is clearly present in non-coding cpDNA. 

Selective pressures exerted on non-coding re- 
gions may be largely a function of the physical 
structure of the sequence and possible functionality 
of introns and intergenic spacers. Reliance on 
methodology developed for coding sequence, which 
includes estimates of constraints on coding se- 
quence evolution, transition/transversion ratios, and 
mutation probabilities, is inappropriate for the 
analysis of non-coding regions. 

Phylogenetic estimations based on genetic dis- 
tance measures of non-coding cpDNA sequences 
must be approached with care. Superficial appli- 
cation of models for maximum likelihood (ML; Fel- 
senstein, 1981) or neighbor-joining (NJ; Saitou & 
Nei, 1987) could easily produce erroneous phylo- 
genetic estimations if several key assumptions un- 
derlying the methodology are violated. 

For example, most models consider a nucleotide 
site as the unit of evolution (Ritland & Eckenwald- 
er, 1992), a consideration that is contradicted by 
the mode of non-coding sequence evolution. Sim- 
plistic models based on the commonly calculated 
Kimura estimates (Kimura, 1980) and Jukes-Cantor 
estimates (Jukes & Cantor, 1969) assume an equal 
25% frequency for each nucleotide type throughout 
the sequence and generate base mutation proba- 
bilities from this assumption. Because non-coding 
cpDNA regions can demonstrate much higher A/T 
content, this assumption is clearly contradicted. 
Furthermore, transition/transversion ratios in non- 
coding regions can differ considerably from coding 
ones (see Hoot & Douglas, 1998), and may even 
vary between discrete non-coding regions of the 
chloroplast genome. Among-site mutation rate het- 
erogeneity is highly probable, especially if regions 
of conservation and hot spots for mutation exist in 
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the data. The presence of multiple gaps in an 
aligned matrix presents an additional hurdle for 
distance analysis, and indels themselves are diffi- 
cult to incorporate as additional characters. 

Countering such complications can be involved 
and computationally demanding. Modification of 
the initial Jukes-Cantor estimates to allow for vary- 
ing base frequencies (e.g., Tajima & Nei, 1984) 
should be employed. Transition/transversion ratios 
can be estimated directly from the non-coding se- 
quence matrix by pairwise sequence comparisons 
(e.g., Yang & Yoder, 1999), eliminating the circu- 
larity occasioned by measures derived from a to- 
pology. More refined distance models that incor- 
porate these problems stand a better chance of 
reflecting the underlying manner of molecular evo- 
lution in non-coding sequence data. Such refined 
models may therefore estimate a more accurate 
phylogeny that better recovers the evolutionary his- 
tory of the characters. 

With ML, transition/transversion estimates are 
dependent on whether among-site rate variation has 
been incorporated in the model and can be sensi- 
tive to the accuracy of the topology used for their 
estimation (Sullivan et al., 1996). Among-site rate 
heterogeneity in the data is often assumed to fit 
either a negative binomial or gamma distribution 
function, and confirmation can be assessed statis- 
tically. Such rate heterogeneity is likely present in 
non-coding sequence data due to the effects of sec- 
ondary structure on mutation likelihoods. Rates of 
variation at sites are usually expected to fit a gam- 
ma distribution model (Yang, 1996), and a param- 
eter (a) can be determined to define the shape of 
that underlying function in an ML analysis (see 
Yang (1994) and Yang (1996) for thorough expla- 
nation). However, Sullivan et al. (1996) suggested 
a estimates are strongly affected by the topology 
used for their estimation. Therefore, to improve the 
ability of a model incorporating gamma distribution 
to recover the “correct” phylogeny, a must be cal- 
culated directly from the data matrix; this should 
be done by pairwise comparison, which can be a 
computationally intensive or even impossible pro- 
cedure as the number of taxa increases in the ma- 
trix (Yang, 1996; Sullivan et al., 1996). Poor esti- 
mation of a can easily result in a misleading 
phylogenetic hypothesis (Yang, 1996; Sullivan et 
al., 1996). 

Other problems associated with non-coding 
cpDNA sequence data may be very difficult to ad- 
dress. If at least some of the mutation in non-coding 
sequences occurs in linked units, then the non- 
independence of these nucleotide characters di- 
rectly affects the subsequent analysis. At present, 
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there is no reliable parameter estimate to incorpo- 
rate such non-independent characters in a distance 
model. Most work on parameter estimates for mod- 
els has been based on coding sequence observa- 
tions, and thus may not reflect the unique aspects 
of molecular evolution in non-coding regions. 

Determining probabilistic estimates for non-cod- 
ing cpDNA mutations is, at this time, difficult; 
therefore, the accurate assessment of the underly- 
ing mode of evolution for maximum likelihood anal- 
ysis may be impossible. As Yang et al. (1995) dis- 
cussed in detail, the accuracy of ML in recovering 
an evolutionary history is strongly dependent on the 
evolutionary model applied. Thus, for non-coding 
cpDNA sequence data (as well as genic sequence 
data), deeper understanding of the manner of evo- 
lution in these regions is required before an accu- 
rate model for ML phylogenetic analysis can be ap- 
plied. 

The frequent alternative to distance measures 
and maximum likelihood is parsimony analysis. 
Heuristic parsimony searches can be considerably 
faster and less computationally intensive than a 
maximum likelihood analysis with the parameter 
adjustments described above; however, they are of- 
ten much slower than a distance analysis. Parsi- 
mony analyses that contain no weighting schemes 
for transition/transversion bias and non-indepen- 
dent mutation of matrix characters may be as vul- 
nerable to recovery of an inaccurate phylogeny as 
similarly simplistic distance models. It has been 
suggested that parsimony’s potential in some cases 
to recover a correct topology decreases significantly 
when among-site rate heterogeneity exists in the 
data (Tateno et al., 1994; Kuhner & Felsenstein, 
1994; Huelsenbeck, 1995). Such rate heterogeneity 
could arise from the structured sequence patterns 
described here in non-coding cpDNA. And though 
it has been proposed that the reliability of parsi- 
mony estimates increases with increasing number 
of taxa included in an analysis (e.g., Wakeley, 1993; 
Sullivan et al., 1995; Yang, 1996), it is unclear if 
this effect is independent of possible among-site 
rate variation. 

Parsimony specifies no particular probabilistic 
evolutionary model, but like all phylogenetic esti- 
mation methods it is influenced by non-indepen- 
dence of characters. This problem can be alleviated 
to a degree if mutations such as inversions and in- 
serted or deleted repeats are recognized as non- 
independent events and are either excluded from 
the analysis or coded separately as described be- 
low. Any non-independent evolution of neighboring 
nucleotides in a sequence would create an artificial 
weighting effect for these positions in a parsimony 
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analysis that considers each nucleotide an inde- 
pendently evolving character. 

Various weighting schemes have been proposed 
to counter this effect. Weighting has been applied, 
for example, to compensatory mutations associated 
with secondary structure in rDNA (e.g., Wheeler & 
Honeycutt, 1988; Dixon & Hillis, 1993; Baldwin et 
al., 1995; Soltis et al., 1997; Soltis & Soltis, 1998). 
Trial weighting schemes have also been applied to 
non-coding sequence data from the chloroplast 
(e.g., Downie et al., 1996a; Liden et al., 1997). 
However, Olmstead et al. (1998) reasoned that an 
erroneous weighting model increases the chance 
that the correct topology is excluded from the most 
parsimonious topologies recovered. In their opin- 
ion, a more general model such as equal weighting 
of characters may limit resolution, but would in- 
crease the chance that the “true” tree is recovered 
by the analysis. Development of defensible weight- 
ing schemes for non-coding sequence data would 
necessarily come from evidence provided by com- 
parative analysis of non-coding regions throughout 
the chloroplast genome, and may be specific to in- 
dividual data sets. The likelihood of misdiagnosing 
an appropriate weighting scheme for subsets of the 
data may still be high. Therefore, it is perhaps sen- 
sible for now to apply equal weighting to non-cod- 
ing sequence characters until we have further evi- 
dence to support a particular weighting scheme. 

Insertions and deletions have been shown to be 
of considerable phylogenetic value (e.g., Golenberg 
et al., 1993; Mes & Hart, 1994; Natali et al., 1995; 
Downie et al., 1996a; Kelchner & Clark, 1997; Ox- 
elman et al., 1997; Sang et al., 1997; Liden et al., 
1997; Downie et al., 1998; Bayer & Starr, 1998), 
and one should consider including gaps as coded 
(present/absent) characters appended to the se- 
quence matrix (e.g., Hodges & Arnold, 1994; Kel- 
chner & Clark, 1997; Sang et al., 1997; Downie et 
al., 1998; Hoot & Douglas, 1998; Bayer & Starr, 
1998). Selection of gaps to be included in the anal- 
ysis, however, is somewhat subjective in that opti- 
mally only those length mutations arguably homol- 
ogous based on size, composition, and related 
mechanistic origin should be included. 

The exclusion of gaps and removal of coded gap 
characters from a non-coding sequence matrix can 
be an interesting and informative approach to 
studying the degree of resolution provided by point 
substitution information alone (e.g., Kelchner, 
1996; Kelchner & Clark, 1997). A similar analysis 
can be conducted by including coded gap charac- 
ters only and excluding all other characters in the 
matrix. Coupled with mapping characters onto a to- 
pology produced from a complete matrix, these par- 


titioned analyses may prove useful in locating and 
determining the degree of problematic homoplasy 
affecting resolution in competing topologies. 

Minute inversions should be identified and re- 
moved from the analysis, to be added as present/ 
absent characters at the end of the matrix (Kelchner 
& Wendel, 1996; Kelchner & Clark, 1997). This 
eliminates potential scoring of multiple non-homol- 
ogous synapomorphies that are artifacts of an in- 
version mutation. 

Of some concern is the tendency to treat nucle- 
otide gap characters of taxa that do not share an 
insertion (i.e., have only spaces present at the in- 
sertion position in the matrix) as missing characters 
when conducting parsimony analysis. This results 
in inferred nucleotide homology for characters in 
the inserted sequences, which leads to cladistic as- 
sessment of their base substitutions. Such an ap- 
proach should be applied only when evidence of 
the homology of inserted sequences is convincing. 
Chaotic regions or other areas where homology as- 
sessment is deemed impossible should be excluded 
from the data matrix before analysis (see Liden et 
al., 1997) to avoid this mistaken claim of nucleotide 
homology. 

Bootstrap (Felsenstein, 1985) and jackknife 
(Farris et al., 1997) analyses, frequently misunder- 
stood to be direct measures of phylogenetic accu- 
racy, are only as sound as their underlying analysis 
procedure. As with coding sequences (see Trueman, 
1993; Hillis & Bull, 1993; Bremer, 1994; Mishler, 
1994; Brown, 1994), both support measures can be 
affected by the non-independent structure present 
in non-coding sequences. The structure invalidates 
a requirement of the statistic that each nucleotide 
be a discrete and independent character. 

Bootstrap and jackknife analyses are a re-sam- 
pling of the data matrix in an effort to statistically 
measure how robustly the data in the matrix sup- 
port a particular topology. The concept is sound, 
but the statistical integrity of both measures relies 
on the assumption that each nucleotide is an in- 
dividual character, that each character evolves ran- 
domly and independently, and that the matrix rep- 
resents a sample of a much larger population of 
characters evolving in identical fashion (Felsen- 
stein, 1985). Due to the non-independent structure 
existing in non-coding regions, and the probably 
unique series of evolutionary constraints acting not 
only on individual non-coding regions but also on 
partitions of a region, each of these assumptions 
may be violated. Sampling from within such a data 
set equates to sampling a nonrandom and non-in- 
dependent subset of a non-existing larger popula- 
tion. A large number of bootstrap replicates should, 
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in theory, cover all possible error due to reduced 
character sampling in each replicate, but the 
strength of the bootstrap test is weakened if the 
characters are not accurately defined. If a character 
in some cases is not an individual nucleotide but 
a suite of nucleotides, the conditions that would 
make bootstrapping and jackknifing accurate as 
measures of data support for a topology are not sat- 
isfied. An analysis would produce an unequal 
weighting effect on subsets of the data in each re- 
sampling due to the frequent localized violation of 
character definition. 

A non-resampling technique that allows assess- 
ment of data support for individual clades is the 
Bremer Support measure (BS, or “decay” analysis; 
Bremer, 1988, 1994; Donoghue et al., 1992; for 
application to large data sets, see Baum et al., 
1994; Morgan, 1997). The measure is a function 
only of the recoverability of clades in topologies 
progressively one step longer. Bremer support has 
the possibility of sidestepping the effects of char- 
acter definition issues discussed above for boot- 
strapping if the model underlying the phylogeny es- 
timation considers the variable nature of character 
definition in a nucleotide set. 

Oxelman et al. (1999) demonstrated that boot- 
strapping and BS evaluate different parameters of 
the data matrix, and are thus not directly compa- 
rable measures (though BS values, when high, may 
be imperfectly correlated with bootstrap and jack- 
knife values). BS values cannot be viewed as prob- 
abilistic estimates themselves (Oxelman et al., 
1999), and an inability to adapt the measures to a 
standard scale that is universally applicable ren- 
ders the technique of dubious worth to some sys- 
tematists. However, the innovation by Oxelman et 
al. (1999) that includes minimal branch length val- 
ues with each BS value does, in a non-standard 
way, improve the comparative information capacity 
of the measure. This procedure may be more mean- 
ingful and informative than bootstrap and jackknife 
values for non-coding cpDNA data. 


CONCLUSIONS 


In summary, great care should be given to the 
alignment and assessment of non-coding sequence 
data. There is considerable evidence now that non- 
coding regions are highly structured, non-randomly 
evolving DNA; thus, alignment by current random- 
ized algorithmic software is rarely adequate. An un- 
derstanding of the proposed mechanisms of muta- 
tion acting on non-coding sequences is critical for 
the positioning of gaps and the better assessment 
of homology of indels and point substitutions. Prob- 
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able secondary structure should be routinely iden- 
tified and used as an important source of informa- 
tion to aid in aligning chaotic or labile regions of 
the data matrix. Prior to phylogenetic analysis, all 
matrices should be carefully reviewed for obscured 
mutational events, such as minute inversions or 
misaligned repeat units. 

Important for understanding molecular evolution 
in non-coding DNA is the concept of the mutational 
trigger (Kelchner, 1996; Kelchner & Clark, 1997), 
a specific sequence pattern that creates the foun- 
dation for a mutational event. Such triggers often 
remain intact after generating a mutation, and their 
presence can easily occasion a repeated, paralleled, 
or reversed mutation event. Triggers may likely be 
responsible for much of the homoplasy of gap char- 
acters inferred in studies at any taxonomic level; 
those applying non-coding sequence data to molec- 
ular systematics should be aware of their occur- 
rence and effect. 

Information of the kind presented here can in- 
crease the predictive value of mutational events in 
non-coding DNA. For example, Kelchner and Wen- 
del (1996) suggested that minute inversions asso- 
ciated with hairpin secondary structures described 
in non-coding cpDNA could occur in similar situ- 
ations in other genomes. Dumolin-Lapégue et al. 
(1998) recently reported just such an event in the 
mitochondria of oak populations of southern 
France. Hence, recommendations proposed in this 
paper for the phylogenetic analysis of non-coding 
cpDNA sequences may likely apply to data from 
non-coding regions of nuclear, and particularly mi- 
tochondrial, genomes. 

Choosing an appropriate non-coding region for a 
particular taxonomic level is essential for maximiz- 
ing its utility as a phylogenetic tool, but there is no 
infallible method for determining what that “prop- 
er” degree of mutation is for a particular study. A 
region’s utility may vary between plant groups that 
are assumed to occupy the same evolutionary level, 
and data from multiple non-coding regions, when 
applied to one taxonomic group, can vary remark- 
ably in phylogenetic utility (see Small et al., 1998). 
In light of the mutational mechanisms outlined in 
this article, at least one concern seems justified: if 
the taxonomic level is too high, one would expect 
saturation of multiple hit sites and concealment of 
multiple hit indels in any non-coding region, de- 
creasing its utility as a phylogenetic tool. 

The perceived intricacies of molecular evolution 
and their bearing on phylogenetic analysis, both in 
non-coding and coding regions (for genes have 
well-known mechanistic biases as well—the codon 
position being just one example) can be discour- 
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aging. However, the phenomena outlined in this ar- 
ticle have solutions in most cases, and attention to 
alignment and analysis should enhance the phylo- 
genetic utility and accuracy of non-coding cpDNA 
data. It should be noted that in almost all system- 
atic studies based on non-coding cpDNA sequenc- 
es, the authors profess to have found sufficient phy- 
logenetic information in their data to warrant its use 
in lower-level phylogenetic analyses. 

Clearly there is a need to develop an understand- 
ing of molecular evolution in non-coding cpDNA 
regions similar to that which exists for chloroplast 
genic DNA. Continued research into non-coding se- 
quence evolution may eventually produce a more 
balanced process for the alignment and phyloge- 
netic analysis of non-coding sequence data. Future 
software may be able to measure and assess prob- 
abilities associated with particular mutational 
mechanisms and incorporate this information into 
the alignment process. This would be an immense 
aid to those systematists who wish to apply non- 
coding molecular tools to the field of plant system- 
atics. 
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